-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on leader lease renewal failure #9563
Conversation
Visit the preview URL for this PR (updated for commit 7cf3c2b): https://gloo-edge--pr9563-dont-crash-on-failed-6s4x9piv.web.app (expires Tue, 25 Jun 2024 00:48:42 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 Sign: 77c2b86e287749579b7ff9cadb81e099042ef677 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @davidjumani - i've read it over a few times and have a few questions. i think a little more documentation (especially around the go funcs) would be helpful. also, a major question regarding whether we still have the regular behavior if we're not configured to recover from failure.
happy to talk in person if that would be easier
ps: i haven't had a chance to look at the test yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good! I mainly focused on how we're testing for now, and once that is updated, I'm happy to take a closer look at the implementation itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me. one question on the comment from Sam regarding whether the goroutine is safe. if so, happy to ignore the nits to approve
kick bulldozer |
* retry on leader lease renewal failure * Die if unable to recover * use env var * add basic tests * udpate tests * add comments * Add commnets around ci changes * update tests * refactor * cleanup * udpate test * rename env var * add changelog * address comments v1 * address comments v2 * fix test * sue GinkgoHelper * Adding changelog file to new location * Deleting changelog file from old location * specify a duration * Adding changelog file to new location * Deleting changelog file from old location * remove default * Adding changelog file to new location * Deleting changelog file from old location * fix tests * move changelog * move conter --------- Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com> Co-authored-by: changelog-bot <changelog-bot> Co-authored-by: Nathan Fudenberg <nathan.fudenberg@solo.io>
Description
After a container has become a leader, any Kube API server unavailability results in the container crashing. This happens as the leader is unable to renew the lease. This is by design as outlined here
However, crashing the gloo pods can also lead to an outage during scaling as the gateway-proxy pods that come up cannot fetch any configs and all routes result in 404s.
Now instead of crashing, the gloo pod falls back to a follower. This prevents an outage and any other pod can take over as leader if possible
This is fine as a leader only writes reports / statuses over here and here. On any failure, the pod becomes a follower and if elected back as a leader will continue to write reports.
Code changes
Reset
method on the identity implementation that allows an identity to fall back to a followerCI changes
Instead, cilium is installed as CNI as we need to test Kube API server unavailability
Context
Kube API unavailability results in a gloo container crash
When leader election fails, gloo crashes
Design Doc
Interesting decisions
Testing steps
Kube2e tests to verify the following :
Notes for reviewers
Be sure to verify intended behavior by ...
Please proofread comments on ...
This is a complex PR and may require a huddle to discuss ...
Checklist:
BOT NOTES:
resolves #8107